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Abstract 

We address the problem of estimating the sensitivity of 
seed-based similarity search algorithms. In contrast to ap- 
proaches based on Markov models 47&l l6l01l4ll7CT. we study 
the estimation based on homogeneous alignments. We de- 
scribe an algorithm for counting and random generation of 
those alignments and an algorithm for exact computation of 
the sensitivity for a broad class of seed strategies. We pro- 
vide experimental results demonstrating a bias introduced 
by ignoring the homogeneousness condition. 



1. Introduction 

Comparing nucleic acid or protein sequences remains by 
far the most common bioinformatics application. The clas- 
sical local alignment problem consists in computing most 
significant similarities between two sequences, or between 
a sequence and a database. The significance of an alignment 
is measured by a score, commonly defined using an addi- 
tive principle by assigning a positive score to each match- 
ing character, and a negative score (penalty) to each mis- 
match and each contiguous gap. 

Best-scoring local alignments can be computed by the 
well-known Smith- Waterman dynamic programming algo- 
rithm 1231 . however for large-scale sequence comparison 
this computation becomes too time-consuming. Sev- 
eral heuristic algorithms have been designed to speed up 
the computation of local alignments, at the price of pos- 
sibly missing some alignments or computing their 
sub-optimal variants. BLAST Q] is the most promi- 
nent representative of this family, Fasta |17J| is another 
example. Both these programs are based on the com- 
mon principle: similarity regions are assumed to share one 
or several short fragments, called seeds, that are used to de- 
tect potential similarities. More recently, it has been un- 
derstood that using non-contiguous (called also spaced or 



gapped) seeds instead of contiguous substrings can con- 
siderably improve the sensitivity/selectivity trade-off. 
PatternHunter [18J was the first method that used care- 
fully designed spaced seeds to improve the sensitivity of 
DNA local alignment. Spaced seeds have been also shown 
to improve the efficiency of lossless filtration for approxi- 
mate pattern matching 1201 171. Earlier, random spaced seeds 
were used in FLASH program |8| to cover sequence sim- 
ilarities, and the sensitivity of this approach was recently 
studied in |5|. For the last two years, spaced seeds re- 
ceived an increasing interest ^ |3] 0] [TO] |51 and have been 
used in new local alignment tools II22II19I . 

Coming back to the concept of alignment score, note that 
aU heuristic algorithms typically try to output alignments 
with a score greater than some user-defined bound. How- 
ever, computing all such alignments would be an ill-defined 
task. Usually, the alignments of interest are those which, on 
the one hand, do not contain in them big regions of neg- 
ative score (in which case the alignment should probably 
be split into two or more alignments of higher score) and on 
the other hand, are not too short to be a part of a larger high- 
scoring similarity. This is captured by the Xdrop heuristics, 
a part of BLAST algorithm: once a seed has been identi- 
fied, the Xdrop algorithm extends it in both directions into 
so-called High-scoring Segment Pair (HSP), as long as the 
"running score" does not decrease more than by a certain 
value. All other seed-based algorithms apply a similar ap- 
proach in that they extend the found seed (or group of seeds) 
outside as far as the total score does not undergo a pro- 
hibitive drop. 

The alignments found in such a way are formalized 
through the notion of maximal scoring segment"^ |21 1. Con- 
sider a gapless alignment which can be naturally translated 
into a sequence of match/mismatch scores. We call this 
alignment (or sequence) homogeneous if it does not contain 



1 We use the term segment instead of subsequence f21j as the latter usu- 
ally does not require the elements to be contiguous. 



a proper contiguous subalignment (segment) whose score is 
greater than that of the whole ahgnment. Given an ahgn- 
ment (or sequence), a homogeneous subaHgnment is called 
a maximal scoring subalignment (segment) if it is extended 
to the right and left as far as the homogeneity property is 
verified. In other words, a maximal scoring segment is not 
included in a larger homogeneous segment. 

Abstracting from possible gaps, alignments found by 
similarity search programs are maximal scoring segments. 
On the other hand, maximal scoring segments capture a bio- 
logically relevant notion of alignment: if an alignment is not 
homogeneous then it is likely to be a merge of two align- 
ments that should be considered as distinct, and if an align- 
ment is not maximal, it is likely to be a part of a larger in- 
teresting alignment. 

In this context, the main motivation of this work can be 
summarized by the following claim: Since homogeneous 
alignments are those which are really found and intended 
to be found by similarity search algorithms, then the effi- 
ciency of those algorithms has to be measured on homoge- 
neous alignments rather than on arbitrary alignments. 

With this motivation in mind, we propose an approach to 
analyze the sensitivity of similarity search algorithms. Us- 
ing this approach, we demonstrate that measuring the sensi- 
tivity of algorithms on general alignments instead of homo- 
geneous alignments (as it is usually done) leads to biased 
results, more specifically to an underestimation of the sen- 
sitivity. 

In this paper, we propose a dynamic programming algo- 
rithm to compute the sensitivity estimator (hit probability) 
with respect to homogeneous alignments. The algorithm, 
which is an extension of algorithms proposed in 1 15 6 3 1, 
works for a wide range of seed definitions (contiguous or 
spaced seeds, single- double- or multiple-seed approaches, 
etc) and therefore can guide the choice of the seed strat- 
egy. It is based on the enumeration of homogeneous align- 
ments. On the other hand, the enumeration allows us to ob- 
tain an efficient random generation algorithm for homoge- 
neous alignments. The latter, in turn, can be used for esti- 
mating the hit probability experimentally by testing the cho- 
sen seed criterion on a large number of random homoge- 
neous alignments. 

Finally, note that the homogeneous sequences have 
other applications in biological sequence analysis. Kar- 
lin, Altschul and other authors fl4', '131 studied long 
homogeneous (and maximal scoring) segments in pro- 
tein sequences, where each residue is assigned a score 
according to a certain scoring function. It has been demon- 
strated that those segments are often biologically signifi- 
cant, and may, for example, correspond to transmembrane 
domains 1 13|. Therefore, the methods proposed in our pa- 
per can potentially apply to other bioinformatics problems 
than the problem of measuring the sensitivity of local aUgn- 



ments programs considered here. 

The paper is organized as follows. Section |2] presents 
algorithms, based on enumeration techniques, for the uni- 
form random generation of homogeneous sequences. Sec- 
tion|3describes an exact algorithm to compute the seed de- 
tection probability on homogeneous sequences. Sectionl^is 
devoted to experiments, and demonstrates the bias induced 
by considering general sequences rather than homogeneous 
ones. Finally, Section |5] discusses possible extensions and 
directions for future work. 

2. Enumeration and random generation of 
homogeneous sequences 

Our main object of study is the gapless alignment of 
two DNA sequences. We represent it by a binary sequence 
A = . . . , hn), bi G {0, 1}, where 1 stands for a match 
and for a mismatch. We assume that each match is as- 
signed a constant positive integer score s, and each mis- 
match a constant negative integer score (penalty) —p, re- 
gardless of the mismatching letters. This is the case for 
many nucleotide scoring systems, for example a popular 
BLAST default scoring system assigns 1 to each match and 
-3 to each mismatch.^ An alternative representation of A 
is then the sequence of individual scores, called score se- 
quence, Xa — (2^1, . . . , Xi e {s, — p} and Xi = 
—p + bi{s + p), and the score of the whole alignment is 

six)^rUx,. 

Definition 1. A binary sequence A (and the associated 
score sequence Xa) is called homogeneous if S{Xa) is 
strictly greater than S{XA[i--j]) for all proper segments 

XA[i.-j] = {xi, . . . ,Xj) (i > I or j < n). 

This section is devoted to the following question: How 
to uniformly generate random homogeneous sequences of a 
given length n, or of given length n and score 5? Here the 
uniformity condition means that each sequence of the con- 
sidered class has the same probability to occur. 

The interest of this question is twofold. First, being able 
to randomly generate homogeneous sequences would allow 
to measure the sensitivity of a similarity search algorithm in 
the experimental way, by testing it against a sample of ran- 
domly drawn homogeneous sequences. Second, as it is of- 



2 More accurate scoring systems assign different penalties to different 
mismatclies. In particular, it is very useful to distinguish between tran- 
sitions (substitutions A ^ G and C <-> T) and transversions (other 
substitutions), since transition mutations occur with a greater rela- 
tive frequency than transversions, particularly in coding sequences. 
This distinction is also useful in seed design, as it allows to consider 
transition-constrained mismatches, as done by BLASTZ 1 22 1 or YASS 
1 19 1 for example. A further step would be to make finer partitions of 
all nucleotide pairs, depending on statistical properties of analyzed se- 
quences. In our setting, this would imply the modeling of alignments 
by sequences over three or more letters. We will discuss this exten- 
sion in Sectionlsl 



ten the case for combinatorial objects, the question of ran- 
dom generation of homogeneous sequences is closely re- 
lated to the question of their enumeration, and the latter will 
be used in the next section to obtain an exact algorithm for 
computing the sensitivity. 

For a binary sequence A — (61, . . . , 6„) and the associ- 
ated score sequence Xa = (xi, . . . , a;„), consider the evo- 
lution of the prefix score foi" ^ = l-.n. The evolu- 
tion can be represented as a walk on 7? starting from the ori- 
gin (0, 0) and evolving through two possible vectors (1, s) 
and (1, —p). The one-to-one relation between a binary se- 




1:1 0:1:1 1:0:1 1 



Figure 1 : An alignment uniquely associated with a walk. 1 
stands for a match and for a mismatch 



quence and a walk is illustrated in FigureQ] 

Homogeneous sequences correspond to walks C on 7? 
starting at (0, 0), ending at (n, S), and verifying two addi- 
tional conditions: 

• C is positive: V{k, y) G C, k > ^ y > 0, 

• C is culminating: \/k < n,{k,y) G C ^ y < S. 

A walk verifying both conditions is called a culminating 
positive walk (CPW). It is easily seen that the condition for 
a walk to be both positive and culminating is equivalent to 
the homogeneity of the underlying sequence. 

We will be interested in two cases, depending on whether 
the total score (culminating point) S is fixed or not. We start 
with the case of sequences of arbitrary score S and then 
show how the algorithm is modified to the case of fixed 
score. Let C„ be the set of all CPWs of size n and arbi- 
trary total score S. 

A classical approach to the random generation of se- 
quences of a given length drawn from a language L is based 
on counting suffixes of those sequences |24|. It allows to 
generate sequences incrementally from left to right. In our 
case, this approach is preferable to the generation by rejec- 
tion. The latter consists in generating sequences uniformly 



among all possible sequences and discarding those that do 
not meet the constraints. Although this method also yields 
a uniform distribution, and generating each candidate se- 
quence can be done in linear time, the time complexity of 
the rejection method heavily depends on s and p parameters. 
It would be efficient if the constraints are not too strong - 
e.g. when the probability of rejection is 0(l/n). In our set- 
ting, if s is smaller than p (which is often the case in prac- 
tice), the rejection probability tends to 1 with an exponen- 
tial speed as n grows, thus leading to an expected exponen- 
tial time complexity of generation. 

The counting approach is based on the following gen- 
eral scheme. Let L be a set of sequences over an alphabet 
S = {ai, . . . , a„i} and L„ be the sequences of L of size n. 
Let Wp be a prefix of some sequence of L„. We call Pa{wp) 
the probability that Wp can be followed by a G S to form a 
sequence of L„: 



Pa{Wp) 



|{ 



w Wr,aw 



\{w"\wpw" G L„}| 



(1) 



Lemma 2. Given values Pa {wp) for all a G A and all pre- 
fixes Wp, one can generate sequences of Ln uniformly. 

Proof. Starting from e, issue consecutively let- 
ters Q;i,...,a„ with probabilities Pai{e), Pa2{cti), 
Pq,, (aiQ;2), PQ„(aiQ;2 ••■ a„_i). The probabil- 
ity of issuing a sequence w = aia2a^ ■ ■ ■ ctn is 

P(qi . . . a„) = 

Pai{£)Pa2{otl)Pa3{oilOl2) . . . Pa„ («! 02 SQ;„_1 ) = 

\{w \aiw ^Lji}\ \{w I (12 GI'Ti } I I cki ■ - -ckti ^ £^tz } I 

\Ln\ \ oi\w'^Ln } I ' ' ' I {iij' I CKi - ■ -CK^_i lij'GIjTi } 



1 



(2) 



as {w'|q!i . . . Unw' G Ln} = {e}. Therefore, this yields a 
uniform generation procedure. □ 




Figure 2: Suffixes B associated to a prefix A of Cn depend 
only of the maximal ordinate h and current ordinate y 



In general, one has to precompute up to values 
Pa{wp). However, in the case of homogeneous sequences. 



it is not necessary to process each prefix Wp individually, 
as the only relevant information of Wp is the maximal or- 
dinate h reached by Wp and the current ordinate y (see 
Figure |2j- Therefore, we introduce the concept of {h,y)- 
initialized walk. 

Definition 3. For h,y > 0, y < h, an {h, y) -initialized 
CPW is a CPW starting at (0, y) and culminating at some 
point {n, S), such that S > h. 

Let Cy,h,n be the set of {h, y)-initialized CPWs of length 



Lemma 4. Assume that Wp is a positive walk from (0,0) 
to {k, y) and h is its maximal ordinate. Then {w'\wpw' £ 

Cn} — ^y,h,7i—\wp\- 

A proof is immediate and is illustrated on Figure^] 
To count the (/i, y) -initialized CPWs of size n for 
aU compatible values of h,y and n, we use the follow- 
ing recursive decomposition of [h, ?;)-initialized CPWs. 
A CPW is represented below as a sequence of vec- 
tors {(1, s), (1, -p)}. 



Lemma 5. For y,h> 0, 



y,h,i 



(l,s) ify + s>h 
otherwise 



(3) 



and for k > 1, 



(1, s) • C'y-|-s^m.ax(/i,y+s),A; — 1 

Cv,h,k ^ { U (1: -p) • Cy-p,h,k-i ify > p, 

(1, s) ■ Cy+s,max{h,y+s),k-i Otherwise. 



(4) 



Proof. A one-step CPW cannot be a (1, — p)-step, as it 
would not be culminating. It can only be a (1, s)-step pro- 
vided that h < y + s. 

The general case is a union of two cases, depending on 
whether the first step is (1, s) or (1, —p). The latter is pos- 
sible only if y > p. If the first step is (1, s), then the new 
maximum is set to max(/i, y + s). □ 

Note that the decomposition of Cyji,n is unambiguous, 
which means that the union operation is a disjoint union. 
Therefore, Lemma [S] gives a recursive formula for com- 
puting the number of CPWs Cy^h,k for different values 
l<fc<n, 0<y, /i<s-n. A dynamic programming 
implementation of the recursion leads to an 0{rr' ) time and 
space complexity. 

Lemma0]allows to count the number of possible suffixes 



for any prefix Wp of a walk of C„ . Let k — n~ 



, y be the 



ordinate of the final point of Wp and h the maximal ordinate 
reached by Wp. Then, the probability Pi (wp) that Wp is fol- 
lowed by a step (1, s) is \Cy+s,mB,yi{h,y+s)M-l\/\Cy,h,k\- 

soon as the values \Cy^h,k\ are computed, a sequence of C„ 
is generated incrementally in time 0{n) using Lemma|2] 



We now modify the method to generate homogeneous 
sequences of fixed score S, which amounts to generating 
CPWs with a fixed cumulating point. This case is simpler, as 
there is only a finite number of possible intermediate scores 
(states). Therefore the set of sequences becomes a regular 
language, for which there exists a linear-time random gener- 
ation algorithm (including the preprocessing time) 112111 II . 
In our case, it is sufficient to precompute an S* x n table stor- 
ing the number of CPW suffixes for each point of the rectan- 
gle specified by corner points (0, 0) and {n, S). Those can 
be seen as walks inside each rectangle with corners (0, y) 
and (fc, S). Let be the set of such walks. The follow- 
ing lemma establishes the corresponding recurrence. 

Lemma 6. For y >0, 



- 



ify + s = S 
otherwise 



(5) 



and for k > 1, 



(1,5) 

U {h~p)-D'' 



k-l 

,S 

y-p,k- 
— 1 
y-p,k-l 



ifp<y<S-s, 
if y <P < S — s 
ifp <S — s<y 
otherwise. 



(6) 



Again, the union is disjoint, and therefore the recurrence 
can be used for counting the cardinality of each f.. Us- 
ing Lemmas |2] and |4] this gives a uniform generation algo- 
rithm of 0{S ■ n) space and time complexity. 

3. Computing the hit probability 

We present now an algorithm for computing the hit prob- 
ability on a random homogeneous sequence, that can be ap- 
plied to different seed strategies such as single, double or 
multiple seeds, contiguous or spaced seeds, etc. The algo- 
rithm can be seen as an extension of the dynamic program- 
ming algorithm of [15 1 for computing the hit probability for 
a single seed, under the Bernoulli model of the sequence. 
The algorithm of fT5\ has been extended in several ways: 
proposed an extension to the (Hidden) Markov Models 
of the sequence; another technique, proposed in |6 1, allows 
to deal with Markov Models of the sequence and with multi- 
ple seeds; finally, in |4|, the algorithm of 1 15 1 was extended 
to another seed model, called vector seeds. A similar-style 
dynamic programming algorithm was proposed in fT^ in a 
purely combinatorial setting, for computing the so-called 
optimal threshold, which is the minimal number of seed oc- 
currences for given sequence length and number of substi- 
tution errors (see also |20|). 

The extension we propose here is of different nature, 
as our probabilistic space is not specified by a probabilis- 
tic model of the alignment, but by a set of possible align- 
ments and the condition of the uniform distribution. In other 



words, here we impose global constraints on the alignments 
(to be homogeneous and to have a given score) rather than 
specifying their local properties, as Markov models do. The 
key of the construction is the representation from the pre- 
vious section of those sequences as random culminating 
walks on the plane, together with counting formulas (|5}, 

For the sake of simplicity of presentation, we describe 
the algorithm for a single spaced seed. At the end of this 
section, we explain how the algorithm can be extended to 
multiple-seed strategies. 

Recall that a seed tt is a string over {0, 1}, where 1 stands 
for 'match' and for a don't care symbol. The length I of tt 
is called the span of tt and the number of I's in tt its weight. 
A seed tt matches a string u £ {0, 1} of length I, if for each 
position i, 7r[i] = 1 implies u[i] — 1. A seed tt detects a se- 
quence (gapless alignment) A £ {0,1}* if tt matches some 
substring of A. 

We now describe a dynamic programming algorithm for 
computing the exact probability that a given seed tt of span 
I, detects a random homogeneous sequence A of length n 
and score S under a scoring system (s, —p). 




sequence length 

Figure 3: Computing the seed detection probability on ho- 
mogeneous sequences 



Consider a prefix of a random homogeneous sequence 
A and assume that this prefix ends with a suffix M. That 
is, let A = BMC and let |B| = i and S{B) = y. Let 
P{i, M, y) be the probability that tt detects the prefix BM 
of a random sequence A (see Figure |3}. Thus, our goal is 
to compute the probability P{n, e, S) using a set of recur- 
sive equations that we define now. The following are initial 
conditions of the recursion: 

(i) P{i,M,y) = 0,ifi + |M| < I, 

(ii) P{i, M, y) = 1, if |M| = / and tt detects M. 

The following conditions insure that the probabilistic space 
is respected. Condition (iii) says that the sequences under 
consideration cannot have a negative score or a score greater 
than S. Condition (iv) is optional but allows to cut off at ear- 
lier stages some infertile branches of the computation. It in- 



sures that the walks are inside the diagonal band defined by 
extremal points (see Figure|3}- 

(iii) M, y) = 0, if 2/ > 5 and i < n, or y < and 

n > 0, 

(iv) P{i, M,y) = 0, if y > i ■ s or y < S — {n — i) ■ s. 
The following conditions describe main recursion steps. 

(v) if TT does not detect l'-l*"lAf6 (b £ {0,1}), then 

P{i,Mb,y)^P{t,M,y), 

(vi) if |M| < /, then P(i,M,y) = PiP{i - 1, l.M,y - 
s) + PoP{i~l, O.AI, y+p), where Pi and Pq are com- 
puted using formulas (|5}, (E): 

p \Ds-y+s,i^l\ p _ \Ds-v-p,i-l\ 



Condition (v) says that if Mb is not a suffix of any match of 
TT, then the last letter b can be dropped out. Condition (vi) is 
the most tricky one. It says that if M is shorter than I, then 
the probability is decomposed into the sum of two terms 
corresponding to two possible states of the walk right be- 
fore the start of M (Figure |3}- A way to compute the prob- 
ability po and pi of each of those states is to "flip over" 
the whole picture and to think of the walk as coming from 
point {n, S) to (0, 0). Then, Pq and Pi are computed by for- 
mula Q using the counting technique described in the pre- 
vious section. The walks that contribute to probabilities Pq 
and Pi are those located inside the shadowed zone in Fig- 
ureEl 

The recursive decomposition of P{n, e, S) goes as fol- 
lows: by applying (vi), the size of M increases up to length 
I, then by alternating (vi) and (v), the size of AI alternates 
between I and [l — 1) while i decreases unless conditions 
(i)-(iv) apply. 

The worst-case complexity of the algorithm is 0(2' •S'-n) 
both in time and space. The time complexity can be im- 
proved by introducing a preprocessing step and exploiting 
the structure of the seed tt, following a general method de- 
scribed in itTsl |3| . If w is the weight of the seed tt, the time 
complexity can be made 0{l ■ 2'^™ ■ {P + S ■ n)). We re- 
fer to |3 1 for details. 

The algorithm presented above can be extended to cer- 
tain multi-seed detection strategies, when a hit is defined as 
two or more proximate occurrences of the seed. A multi- 
seed strategy is used in Gapped BLAST |2| (two non- 
overlapping seed occurrences), BLAT |16| (two or more 
non-overlapping occurrences), PatternHunter fTs\ (two pos- 
sibly overlapping occurrences), YASS 1 19 1 (any number of 
possibly overlapping occurrences, with additional restric- 
tion on the overlap size). 

To extend the algorithm to the case of K seeds with- 
out constraints on the overlap, it is sufficient to perform the 



recursion on the probability P{i, M, y, k), where the addi- 
tional parameter k, < k < K, means that k distinct oc- 
currences of the seed are assumed to occur in BAI (see Fig- 
ure |3}- The modification will mainly concern relation (ii), 
which will read as P{i, M,y, k) — P{i, AI^ ,y, k — 1), 
where means word M without the rightmost letter If 
the overlap between two successive seeds is upper-bounded 
by some constant A (possibly zero, in which case no over- 
lap is allowed), the modification still holds, except that 
should be set to M without A rightmost letters. If the de- 
tection strategy imposes an upper bound on the distance 
between two neighboring seeds, the recursion gets more 
complex, as yet another parameter should be introduced to 
"store" the distance between the closed seed on the left. 

4. Experimental results 

To demonstrate the bias introduced by ignoring the prop- 
erty of homogeneity, we compared the detection proba- 
bilities of different seed strategies on homogeneous vs 
non-constrained alignments of a given score and differ- 
ent lengths. The probabilities for homogeneous alignments 
were computed by the algorithm of Section |3l For gen- 
eral alignments, a simpler version of the algorithm was 
used, that does not account for the homogeneity con- 
straint (details are left out). 

Figurel^shows the results. Each plot represents the prob- 
ability of a certain seed to detect homogeneous/arbitrary 
alignments of a certain score as a function of their length. 
All experiments reported in this section use the default 
BLAST H-1/-3 scoring system. Left and right plots corre- 
spond to score 16 and 32 respectively. The upper row cor- 
responds to the seed 110100110010101111 of weight 11 
and span 18 (implemented in PatternHunter 1 18 1), while the 
lower row corresponds to the contiguous seed of weight 1 1 
(implemented in BLAST 1 LLI). 

For all settings, the results clearly show that ignoring 
the condition of homogeneity leads to a considerable under- 
estimation of the sensitivity. The fraction of homogeneous 
alignments missed is, in most cases, at least two times less 
than what one would expect out of measurements on arbi- 
trary alignments. 

One of the most important applications of measuring the 
sensitivity is the design of optimal spaced seeds for the de- 
tection of alignments of a given type. Therefore, we made 
another group of experiments aiming at comparing the most 
sensitive seeds for homogeneous vs arbitrary alignments. 
Some results are shown in Tabled 

We computed optimal seeds for detecting homogeneous 
and arbitrary alignments of length 40, for several score val- 
ues (between 12 and 24). Some results are shown in Tabled 
They have been obtained by an exhaustive search through 
all seeds of span up to 20. The probabilities were computed 



by the algorithm of Section |3] The table shows that for the 
same parameters (alignment score and seed weight), the op- 
timal seeds are different, depending on whether the optimal- 
ity is defined with respect to all alignments (probability Pa) 
or only homogeneous ones (probability P/,). In each case, 
the highest probability is shown in slanted characters. 



(n,S) 


w 


TTji, {n,Sj 


Ph 


Pa 


(40, 12) 


9 


1110010110111 


0.986271 


0.902372 


(40, 12) 


9 


111001001010111 


0.983516 


0.917869 


(40, 16) 


9 


1110010110111 


0.998399 


0.988887 


(40, 16) 


9 


1100110101111 


0.998353 


0.989535 


(40, 16) 


10 


11101100101111 


0.98742 


0.938499 


(40, 16) 


10 


110110010101111 


0.98740 


0.942769 


(40, 20) 


10 


11101001110111 


0.999172 


0.996303 


(40, 20) 


10 


110110010101111 


0.999065 


0.996555 


(40, 20) 


11 


111011101001111 


0.975462 


0.993076 


(40, 24) 


11 


111010011110111 


0.999891 


0.999661 



Table 1 : Optimal seeds for homogeneous vs arbitrary align- 
ments 



Furthermore, we have performed a similarity search 
based on seeds from Table [l] using YASS software 1191 . 
In particular, we compared the number of alignments 
found using the seed optimized on homogeneous align- 
ments vs the one for arbitrary alignments. Table |2l shows 
the results for the seeds from Tabled of weight 9 and 10 
(7r^(n, S*), respectively TT'^{n,S), stands for the optimal 
seed from Table^computed on homogeneous and all align- 
ments respectively). The experiments were made on com- 
paring full chromosomes IV (1560kb), V (580kb), IX 
(450kb), XVI (960kb) of Saccharomyces Cerevisiae 
against each other or against themselves. Both strands 
of each chromosome has been processed in each experi- 
ment (-r 2 option of YASS). The search was done with 
the "group size" parameter 10 and 11 for seed weight re- 
spectively 9 and 10 (option -s of YASS). The results show 
that, in most cases, the seed tuned for homogeneous aUgn- 
ments allows to identify more relevant similarities than the 
seed optimized for all alignments. 

5. Discussion 

In this paper we presented an approach to measure the 
sensitivity of seed-based similarity search strategies. The 
main point is to compute the hit probability over homo- 
geneous alignments, rather than arbitrary alignments. The 
property of homogeneity requires that the alignment does 
not contain significant negative-score segments occurring 
either inside the alignment (in which case the alignment 
can be decomposed into alignments of higher score), or at 



Figure 4: Seed detection probability on homogeneous vs arbitrary alignments 
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7rg{40,12) 7r9{40,12) 


TTio (40.20) TTio (40,20) 


IX 


519 


519 


502 496 


IX/ V 


364 


356 


342 329 


IX / XVI 


408 


387 


383 348 


IX /IV 


523 


521 


488 473 


V 


500 


487 


477 466 


V/XVI 


961 


955 


937 891 


V/IV 


1273 


1258 


1248 1192 


XVI 


539 


554 


545 510 


XVI / IV 


1429 


1448 


1452 1368 


IV 


1542 


1539 


1510 1461 



Table 2; Number of high-scoring alignments found with a 
seed optimized for homogeneous alignments (left column) 
vs that optimized for all alignments (right column) 



the edges (in which case a subalignment should be consid- 
ered). In this paper, we showed that ignoring this property 
leads to a bias in estimating the detection capacity of seeds. 

Recently proposed approaches to estimate the seed de- 
tection probability |6 3 4| assume a Markov model of 
alignment, that specifies its local composition. The ap- 



proach of this paper is complementary, as we only impose 
global constraints (homogeneity, total score) and abstract 
from local properties of the alignment. If we want to ac- 
count for local properties, the assumption that all fixed- 
score homogeneous sequences are equiprobable would be 
no longer justified. 

Note that one of the drawbacks of the Markov model 
approach of 1 6 3 , 4 1 is that the alignment score is taken 
into account only indirectly, through the expected com- 
position of the alignment (or identity rate, in case of the 
match/mismatch model). This is a serious disadvantage if 
one has to measure the probability on alignments of differ- 
ent length (as in 1 19 1), since in this case the same score gen- 
erally corresponds to different identity rates. The approach 
proposed in this paper is based on the score rather than on 
the identity rate. 

Our analysis has been based on the match/mismatch 
model of alignment. However, sometimes it is very useful to 
distinguish between different mismatches, for example be- 
tween transitions and transversions (see the footnote in Sec- 
tion|3. This approach leads to modeling alignments by se- 
quences on non-binary alphabets (e.g. a three-letter alpha- 
bet of match/transition/transversion). From a pure computer 



science point of view, the results of Sections [2l3l can be ex- 
tended to the non-binary alphabet. However, from the bio- 
logical point of view, the assumption of the uniform distri- 
bution over all sequences becomes unrealistic in this setting, 
as letters are obviously no more "equivalent", and proper- 
ties of alignment composition must be added to the model. 
To conclude, the ultimate approach should take into account 
both global properties of the alignment and its local and 
compositional properties. Designing such an approach re- 
mains a challenging problem. 
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